Textmining and Organization in Large Corpus
نویسندگان
چکیده
Nowadays a common size of document corpus might have more than 5000 documents. It is almost impossible for a reader to read thought all documents within the corpus and find out relative information in a couple of minutes. In this master thesis project we propose text clustering as a potential solution to organizing large document corpus. As a sub-field of data mining, text mining is to discover useful information from written resources. Text clustering is one of topics in text mining, which is to find out the groups information from the text documents and cluster these documents into the most relevant groups automatically. Representing document corpus as a term-document matrix is the prevalent preprocessing in text clustering. If each unique term is taken as a dimension, a common size of corpus may contain more than ten-thousands of unique term, which results in extremely high dimensionality. Finding good dimensionality deduction algorithms and suitable clustering methods are the main concerns of this thesis project. We mainly compare two dimensionality deduction methods: Singular Vector Decomposition (SVD) and Random Projection (RP), and three selected clustering algorithms: K-means, Non-negative Matrix Factorization (NMF) and Frequent Itemset. These selected methods and algorithms are compared based on their performance and time consumption. This thesis project shows K-means and Frequent Itemset can be applied in large corpus. NMF might need more research on speeding up its convergence speed.
منابع مشابه
GENIA corpus - a semantically annotated corpus for bio-textmining
MOTIVATION Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. RESULTS G...
متن کاملTextmining: Generating association rules from textual data
Textmining is an emerging research area, whose goal is to discover additional information from hidden patterns in unstructured large textual collection. Hence, given a collection of text documents, most approaches of text mining perform knowledge-discovery operations on labels associated with each document, which are usually keywords that represent the result of non-trivial keyword-labeling pro...
متن کاملLarge Sphenoethmoidal Encephalocele Associated with Agenesis of Corpus Callosum and Cleft Palate
Basal encephalocele is a rare craniofacial anomaly. In the present paper we report a 10-year-old boy presented with cleft palate, congenital nystagmus, and hypertelorism. During preoperative evaluation for cleft palate repair, a pulsatile mass was detected in the pharynx. Magnetic resonance imaging showed sphenoethmoidal type of basal encephalocele and agenesis of corpus callosum. Neurosurgical...
متن کاملModeling text with generalizable Gaussian mixtures
We apply and discuss generalizable Gaussian mixture (GGM) models for textmining. The model automatically adapts model complexity for a given text representation. We show that the generalizability of these models depends on the dimensionality of the representation and the sample size. We discuss the relation between supervised and unsupervised learning in text data. Finally, we implement a novel...
متن کاملA symbolic approach to automatic multiword term structuring
This paper presents a three-level structuring of multiword terms (MWTs) basing on lexical inclusion, WordNet similarity and a clustering approach. Term clustering by automatic data analysis methods offers an interesting way of organizing a domain’s knowledge structures, useful for several information-oriented tasks like science and technology watch, textmining, computer-assisted ontology popula...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005